Web Genre Benchmark Under Construction
نویسندگان
چکیده
The project discussed in this article focuses on the creation of web genre benchmarks (a.k.a. web genre reference corpora or web genre test collections), i.e. newly conceived test collections against which it will be possible to judge the performance of future genre-enabled web applications. The creation of web genre benchmarks is of key importance for the next generation of web applications because, at present, it is impossible to evaluate existing and in-progress genre-enabled prototypes. We suggest focusing on the following key points: 1) propose a characterisation of genre suitable for digital environments and empirical approaches shared by a number of genre experts working in automatic genre identification; 2) define the criteria for the construction of web genre benchmarks and draw up annotation guidelines; 3) create several web genre benchmarks in several languages; 4) validate the methodology and evaluate the results.
منابع مشابه
Genre Classification of Web Pages
Genre classification means to discriminate between documents by means of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents. While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. W...
متن کاملCommon Criteria for Genre Classification: Annotation and Granularity
In this paper,we present two experiments that use machine learning for automatically classifying web pages by genre. These experiments highlight the influence that genre annotation and genre granularity can have on the accuracy of the classification. From a practical point of view these experiments show that a collection annotated with the criteria of ‘objective sources’ and consistent genre gr...
متن کاملLeveraging Website Genre and Structure Information for Fake Website Detection
In this study we assessed the efficacy of using website genre composition and design structure information for fake website detection. A genre tree kernel was proposed that creates a rooted tree from the website file directory structure, and labels the tree’s file nodes with genre information. The genre tree kernel was compared against several benchmark kernel and non-kernel methods that utiliz...
متن کاملRefined and Incremental Centroid-based approach for Genre Categorization of Web pages
In this paper, I propose a refined and incremental centroid-based approach for genre categorization of web pages. My approach is based on the construction of genre centroids using a set of training web pages. These centroids will be used to classify new web pages. The originality of my approach is the implementation of two new aspects, which are refining and incrementing. My approach is based o...
متن کاملWeb Genre Analysis: Use Cases, Retrieval Models, and Implementation Issues
People who search the World Wide Web often have a multifaceted understanding of their information need: they know what they are searching for, and they know of which form or type the desired documents should be. The former aspect relates to the content of a desired document (= topic), the latter to the presentation of its content and the intended target group. Due to the different user groups a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JLCL
دوره 24 شماره
صفحات -
تاریخ انتشار 2009